Search CORE

171 research outputs found

Two-Stream Convolutional Networks for Action Recognition in Videos

Author: Simonyan Karen
Zisserman Andrew
Publication venue
Publication date: 12/11/2014
Field of study

We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification

arXiv.org e-Print Archive

Oxford University Research Archive

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Author: Simonyan Karen
Vedaldi Andrea
Zisserman Andrew
Publication venue
Publication date: 01/01/2014
Field of study

This paper addresses the visualisation of image classification models, learnt using deep Convolutional Networks (ConvNets). We consider two visualisation techniques, based on computing the gradient of the class score with respect to the input image. The first one generates an image, which maximises the class score [Erhan et al., 2009], thus visualising the notion of the class, captured by a ConvNet. The second technique computes a class saliency map, specific to a given image and class. We show that such maps can be employed for weakly supervised object segmentation using classification ConvNets. Finally, we establish the connection between the gradient-based ConvNet visualisation methods and deconvolutional networks [Zeiler et al., 2013]

arXiv.org e-Print Archive

CiteSeerX

Oxford University Research Archive

EmbraceNet for Activity: A Deep Multimodal Fusion Architecture for Activity Recognition

Author: Abadi Martín
Diederik
Ngiam Jiquan
Simonyan Karen
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 28/04/2020
Field of study

Human activity recognition using multiple sensors is a challenging but promising task in recent decades. In this paper, we propose a deep multimodal fusion model for activity recognition based on the recently proposed feature fusion architecture named EmbraceNet. Our model processes each sensor data independently, combines the features with the EmbraceNet architecture, and post-processes the fused feature to predict the activity. In addition, we propose additional processes to boost the performance of our model. We submit the results obtained from our proposed model to the SHL recognition challenge with the team name "Yonsei-MCML."Comment: Accepted in HASCA at ACM UbiComp/ISWC 2019, won the 2nd place in the SHL Recognition Challenge 201

arXiv.org e-Print Archive

Crossref

Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

Author: Jaderberg Max
Simonyan Karen
Vedaldi Andrea
Zisserman Andrew
Publication venue
Publication date: 01/01/2014
Field of study

In this work we present a framework for the recognition of natural scene text. Our framework does not require any human-labelled data, and performs word recognition on the whole image holistically, departing from the character based recognition systems of the past. The deep neural network models at the centre of this framework are trained solely on data produced by a synthetic text generation engine -- synthetic data that is highly realistic and sufficient to replace real data, giving us infinite amounts of training data. This excess of data exposes new possibilities for word recognition models, and here we consider three models, each one "reading" words in a different way: via 90k-way dictionary encoding, character sequence encoding, and bag-of-N-grams encoding. In the scenarios of language based and completely unconstrained text recognition we greatly improve upon state-of-the-art performance on standard datasets, using our fast, simple machinery and requiring zero data-acquisition costs

arXiv.org e-Print Archive

Oxford University Research Archive